{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "banned-survey",
   "metadata": {},
   "source": [
    "# 数据预处理（csv文件）\n",
    "\n",
    "视频地址：https://www.bilibili.com/video/BV1CV411Y7i4?spm_id_from=333.999.0.0\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "acquired-charlotte",
   "metadata": {},
   "source": [
    "## 创建一个人工数据集，存储在csv文件里\n",
    "\n",
    ">`import os`\n",
    "\n",
    ">创建路径：\n",
    ">>代码：`os.path.join('文件夹名', '文件夹名')`  \n",
    ">>说明：该方法可以直接将文件夹连接成路径，不需要打/\n",
    "\n",
    ">递归创建文件夹目录：\n",
    ">>代码：`os.makedirs('创建文件夹路径', exist_ok=True)`\n",
    "\n",
    ">写入csv文件:\n",
    ">>代码：  \n",
    ">>`with open('csv文件路径', '写方式') as 指针: `  \n",
    ">> &nbsp;&nbsp;&nbsp;&nbsp;`指针.write('元素,元素,元素\\n')`\n",
    ">>\n",
    ">>说明：第0行的写入会作为csv文件的列索引名称标签，元素`NA`为空值  \n",
    ">>写方式参数：  \n",
    ">>>w: 打开一个文件只用于写入。文件已存在则将其覆盖。文件不存在，创建新文件  \n",
    ">>>a: 打开一个文件用于追加。文件已存在，文件指针将会放在文件的结尾。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "crazy-berry",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.makedirs(os.path.join('.', '2.data'), exist_ok=True)\n",
    "data_file = os.path.join('.', '2.data', 'house_tiny.csv')\n",
    "\n",
    "# scv 数据用,逗号隔开\n",
    "with open(data_file, 'w') as f:\n",
    "    f.write('NumRooms,Alley,price\\n') # 列名\n",
    "    f.write('NA,pava,127500\\n') # 每行一个数据样本\n",
    "    f.write('2,NA,106000\\n')\n",
    "    f.write('4,NA,178100\\n')\n",
    "    f.write('NA,NA,140000\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "exposed-council",
   "metadata": {},
   "source": [
    "## 使用pandas读取csv数据集\n",
    "\n",
    ">`import pandas as pd`\n",
    "\n",
    ">导入csv文件数据：\n",
    ">>代码：`数据集变量 = pd.read_csv('csv文件路径')`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "twelve-sacrifice",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   NumRooms Alley   price\n",
      "0       NaN  pava  127500\n",
      "1       2.0   NaN  106000\n",
      "2       4.0   NaN  178100\n",
      "3       NaN   NaN  140000\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import os\n",
    "\n",
    "\n",
    "data_file = os.path.join('.', '2.data', 'house_tiny.csv')\n",
    "data = pd.read_csv(data_file)\n",
    "print(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hispanic-syracuse",
   "metadata": {},
   "source": [
    "## 处理缺失数据NA\n",
    ">选取数据：\n",
    ">>代码：`数据集变量.iloc[行区域, 列区域]`  \n",
    ">>代码：`数据集变量.loc['行索引名称标签', '列索引名称标签' ]`  \n",
    ">>说明：对于本csv文件而言，`数据集变量.loc[:, 'NumRooms'] == 数据集变量.iloc[:, 0]`\n",
    "\n",
    ">填充数值域缺失值:\n",
    ">>用已有数据平均值填充：`数据集变量 = 数据集变量.fillna(数据集变量.mean())`  \n",
    "\n",
    ">将文本数据变为数值（非数值域缺失处理）:\n",
    ">>代码：`数据集变量 = pd.get_dummies(数据集变量, dummy_na=是否将缺失值分类True/False)`  \n",
    ">>说明：将缺失的文本NA分为一个类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "alert-accident",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第0、1列原始数据作为input：\n",
      "   NumRooms Alley\n",
      "0       NaN  pava\n",
      "1       2.0   NaN\n",
      "2       4.0   NaN\n",
      "3       NaN   NaN\n",
      "\n",
      "\n",
      "数值域缺失值处理，缺失值被已有数据的平均值填充：\n",
      "   NumRooms Alley\n",
      "0       3.0  pava\n",
      "1       2.0   NaN\n",
      "2       4.0   NaN\n",
      "3       3.0   NaN\n",
      "非数值域缺失值处理，将文本数据变为数值：\n",
      "   NumRooms  Alley_pava  Alley_nan\n",
      "0       3.0           1          0\n",
      "1       2.0           0          1\n",
      "2       4.0           0          1\n",
      "3       3.0           0          1\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "inputs = data.iloc[:, 0:2]  # 取出每行的第0、1列 作为input\n",
    "outputs = data.iloc[:, 2] ## 取出每行的第2列 作为output\n",
    "print(\"第0、1列原始数据作为input：\")\n",
    "print(inputs)\n",
    "print('\\n')\n",
    "\n",
    "# 缺失值处理\n",
    "inputs = inputs.fillna(inputs.mean())\n",
    "print(\"数值域缺失值处理，缺失值被已有数据的平均值填充：\")\n",
    "print(inputs)\n",
    "\n",
    "inputs = pd.get_dummies(inputs, dummy_na=True)\n",
    "print(\"非数值域缺失值处理，将文本数据变为数值：\")\n",
    "print(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "prospective-norfolk",
   "metadata": {},
   "source": [
    "## 将csv条目转化为张量\n",
    ">`import torch`\n",
    "\n",
    ">csv数据转换为torch数据格式：\n",
    ">>代码：`torch.tensor(数据集变量.values)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "legal-vacation",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "inputs原始数据\n",
      "   NumRooms  Alley_pava  Alley_nan\n",
      "0       3.0           1          0\n",
      "1       2.0           0          1\n",
      "2       4.0           0          1\n",
      "3       3.0           0          1\n",
      "转换后数据\n",
      "tensor([[3., 1., 0.],\n",
      "        [2., 0., 1.],\n",
      "        [4., 0., 1.],\n",
      "        [3., 0., 1.]], dtype=torch.float64)\n",
      "\n",
      "\n",
      "outputs原始数据\n",
      "0    127500\n",
      "1    106000\n",
      "2    178100\n",
      "3    140000\n",
      "Name: price, dtype: int64\n",
      "转换后数据\n",
      "tensor([127500, 106000, 178100, 140000])\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "x, y = torch.tensor(inputs.values), torch.tensor(outputs.values)\n",
    "\n",
    "# inputs数据\n",
    "print(\"inputs原始数据\")\n",
    "print(inputs)\n",
    "print(\"转换后数据\")\n",
    "print(x)\n",
    "\n",
    "print('\\n')\n",
    "\n",
    "print(\"outputs原始数据\")\n",
    "print(outputs)\n",
    "print(\"转换后数据\")\n",
    "print(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "surprised-elimination",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
